48 research outputs found

    Design and Implementation of MPICH2 over InfiniBand with RDMA Support

    Full text link
    For several years, MPI has been the de facto standard for writing parallel applications. One of the most popular MPI implementations is MPICH. Its successor, MPICH2, features a completely new design that provides more performance and flexibility. To ensure portability, it has a hierarchical structure based on which porting can be done at different levels. In this paper, we present our experiences designing and implementing MPICH2 over InfiniBand. Because of its high performance and open standard, InfiniBand is gaining popularity in the area of high-performance computing. Our study focuses on optimizing the performance of MPI-1 functions in MPICH2. One of our objectives is to exploit Remote Direct Memory Access (RDMA) in Infiniband to achieve high performance. We have based our design on the RDMA Channel interface provided by MPICH2, which encapsulates architecture-dependent communication functionalities into a very small set of functions. Starting with a basic design, we apply different optimizations and also propose a zero-copy-based design. We characterize the impact of our optimizations and designs using microbenchmarks. We have also performed an application-level evaluation using the NAS Parallel Benchmarks. Our optimized MPICH2 implementation achieves 7.6 μ\mus latency and 857 MB/s bandwidth, which are close to the raw performance of the underlying InfiniBand layer. Our study shows that the RDMA Channel interface in MPICH2 provides a simple, yet powerful, abstraction that enables implementations with high performance by exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is the first high-performance design and implementation of MPICH2 on InfiniBand using RDMA support.Comment: 12 pages, 17 figure

    Memory Registration Caching Correctness

    No full text
    Fast and powerful networks are becoming more popular on clusters to support applications including message passing, file systems, and databases. These networks require special treatment by the operating system to obtain high throughput and low latency. In particular, application memory must be pinned and registered in advance of use. However, popular communication libraries such as MPI have interfaces that do not require explicit registration calls from the user, thus the libraries must manage this aspect themselves

    A Performance Analysis of the Ammasso RDMA Enabled Ethernet Adapter and its iWARP API

    No full text
    Network speeds are increasing well beyond the capabilities of today's CPUs to efficiently handle the traffic. This bottleneck at the CPU causes the processor to spend more of its time handling communication and less time on actual processing. As network speeds reach 10 Gb/s and more, the CPU simply can not keep up with the data. Various methods have been proposed to solve this problem. High performance interconnects, such as Infiniband, have been developed that rely on RDMA and protocol offload in order to achieve higher throughput and lower latency. In this paper we evaluate the feasibility of a similar approach which, unlike existing high performance interconnects, requires no special infrastructure. RDMA over Ethernet, otherwise known as iWARP, facilitates the zero copy exchange of data over ordinary local area networks. Since it is based on TCP, iWARP enables RDMA in the wide area network as well. This paper provides a look into the performance of one of the earliest commodity implementations of this emerging technology, the Ammasso 1100 RNIC

    Simulation studies of Gigabit ethernet versus Myrinet using real application cores

    No full text
    Parallel cluster computing projects use a large number of commodity PCs to provide cost-effective computational power to run parallel applications. Because properly load-balanced distributed parallel applications tend to send messages synchronously, minimizing blocking is as crucial a requirement for the network fabric as are those of high bandwidth and low latency. We consider the selection of an optimal, commodity-based, interconnect network technology and topology to provide high bandwidth, low latency, and reliable delivery. Since our network design goal is to facilitate the performance of real applications, we evaluated the performance of myrinet and gigabit ethernet technologies in the context of working algorithms using modeling and simulation tools developed for this work. Our simulation results show that myrinet behaves well in the absence of congestion. Under heavy load, its latency suffers due to blocking in the distributed wormhole routing scheme. Conventional gigabit ethernet switches can not scale to support more than 64 gigabit ethernet ports today which leads to the use of cascaded switches. Bandwidth limitation in the interswitch links and extra storeand -forward delays limit aggregate performance of this configuration. The Avici switch router uses six 40 Gbps internal links to connect individual switching nodes in a wormhole-routed three-dimensional torus. Additionaly, the fabric's large speed-up factor and its per-connection buffer management scheme provides for non-blocking deliveries under heavy load.

    Abstract

    No full text
    Interconnect speeds currently surpass the abilities of today’s processors to satisfy their demands. The throughput rate provided by the network simply generates too much protocol work for the processor to keep up with. Remote Direct Memory Access has long been studied as a way to alleviate the strain from the processors. The problem is that until recently RDMA interconnects were limited to proprietary or specialty interconnects that are incompatible with existing networking hardware. iWARP, or RDMA over TCP/IP, changes this situation. iWARP brings all of the advantages of RDMA, but is compatible with existing network infrastructure, namely TCP/IP over Ethernet. The drawback to iWARP up to now has been the lack of availability of hardware capable of meeting the performance of specialty RDMA interconnects. Recently, however, 10 Gigabit iWARP adapters are beginning to appear on the market. This paper demonstrates the performance of one such 10 Gigabit iWARP implementation and compares it to a popular specialty RDMA interconnect, InfiniBand

    Distributed Queue-based Locking using Advanced Network Features

    No full text
    A Distributed Lock Manager (DLM) provides advisory locking services to applications such as databases and file systems that run on distributed systems. Lock management at the server is implemented using First-In-First-Out (FIFO) queues. In this paper, we demonstrate a novel way of delegating the lock management to the participating lock-requesting nodes, using advanced network primitives such as Remote Direct Memory Access (RDMA) and Atomic operations. This nicely complements the original idea of DLM, where management of the lock space is distributed. Our implementation achieves better load balancing, reduction in server load and improved throughput over traditional designs
    corecore